Search CORE

6 research outputs found

Details of Deformable Part Models for Automatically Georeferencing Historical Map Images

Author: Gouwar John
Howe Nicholas
Shamji Aabid
Weinman Jerod
Publication venue: Smith ScholarWorks
Publication date: 01/01/2019
Field of study

Libraries are digitizing their collections of maps from all eras, generating increasingly large online collections of historical cartographic resources. Aligning such maps to a modern geographic coordinate system greatly increases their utility. This work presents a method for such automatic georeferencing, matching raster image content to GIS vector coordinate data. Given an approximate initial alignment that has already been projected from a spherical geographic coordinate system to a Cartesian map coordinate system, a probabilistic shape-matching scheme determines an optimized match between the GIS contours and ink in the binarized map image. Us- ing an evaluation set of 20 historical maps from states and regions of the U.S., the method reduces average alignment RMSE by 12%

Smith College: Smith ScholarWorks

Deformable Part Models for Automatically Georeferencing Historical Map Images

Author: Gouwar John
Howe Nicholas
Shamji Aabid
Weinman Jerod
Publication venue: Smith ScholarWorks
Publication date: 01/11/2019
Field of study

Libraries are digitizing their collections of maps from all eras, generating increasingly large online collections of historical cartographic resources. Aligning such maps to a modern geographic coordinate system greatly increases their utility. This work presents a method for such automatic georeferencing, matching raster image content to GIS vector coordinate data. Given an approximate initial alignment that has already been projected from a spherical geographic coordinate system to a Cartesian map coordinate system, a probabilistic shape-matching scheme determines an optimized match between the GIS contours and ink in the binarized map image. Using an evaluation set of 20 historical maps from states and regions of the U.S., the method reduces average alignment RMSE by 12%

Smith College: Smith ScholarWorks

Cryptographic Hardness Under Projections for Time-Bounded Kolmogorov Complexity

Author: Allender Eric
Gouwar John
Hirahara Shuichi
Robelle Caleb
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd International Symposium on Algorithms and Computation (ISAAC 2021)
Publication date: 01/01/2021
Field of study

A version of time-bounded Kolmogorov complexity, denoted KT, has received attention in the past several years, due to its close connection to circuit complexity and to the Minimum Circuit Size Problem MCSP. Essentially all results about the complexity of MCSP hold also for MKTP (the problem of computing the KT complexity of a string). Both MKTP and MCSP are hard for SZK (Statistical Zero Knowledge) under BPP-Turing reductions; neither is known to be NP-complete. Recently, some hardness results for MKTP were proved that are not (yet) known to hold for MCSP. In particular, MKTP is hard for DET (a subclass of P) under nonuniform ?^{NC^0}_m reductions. In this paper, we improve this, to show that the complement of MKTP is hard for the (apparently larger) class NISZK_L under not only ?^{NC^0}_m reductions but even under projections. Also, the complement of MKTP is hard for NISZK under ?^{P/poly}_m reductions. Here, NISZK is the class of problems with non-interactive zero-knowledge proofs, and NISZK_L is the non-interactive version of the class SZK_L that was studied by Dvir et al. As an application, we provide several improved worst-case to average-case reductions to problems in NP, and we obtain a new lower bound on MKTP (which is currently not known to hold for MCSP)

Dagstuhl Research Online Publication Server

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Author: Anderson Carolyn Jane
Cassano Federico
Gouwar John
Greenberg Michael
Guha Arjun
Jangda Abhinav
Lucchetti Francesca
Schlesinger Claire
Publication venue
Publication date: 21/08/2023
Field of study

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as a building block for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming languages. Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages, like OCaml and Racket. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T, translates training data from high-resource languages into training data for low-resource languages. We apply our approach to generate tens of thousands of new, validated training items for Racket, OCaml, and Lua from Python. Moreover, we use an open dataset (The Stack) and model (StarCoderBase), which allow us to decontaminate benchmarks and train models on this data without violating the model license. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase that achieve state-of-the-art performance for Racket, OCaml, and Lua on benchmark problems. For Lua, our fine-tuned model achieves the same performance as StarCoderBase as Python -- a very high-resource language -- on the MultiPL-E benchmarks. For Racket and OCaml, we double their performance on MultiPL-E, bringing their performance close to higher-resource languages such as Ruby and C#

arXiv.org e-Print Archive

A Scalable and Extensible Approach to Benchmarking NL2Code for 18 Programming Languages

Author: Anderson Carolyn Jane
Cassano Federico
Feldman Molly Q
Gouwar John
Greenberg Michael
Guha Arjun
Jangda Abhinav
Nguyen Daniel
Nguyen Sydney
Phipps-Costin Luna
Pinckney Donald
Yee Ming-Ho
Zi Yangtian
Publication venue
Publication date: 08/11/2022
Field of study

Large language models have demonstrated the ability to condition on and generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We facilitate the exploration of this topic by proposing MultiPL-E, the first multi-language parallel benchmark for natural-language-to-code-generation. MultiPL-E extends the HumanEval benchmark (Chen et al, 2021) to support 18 more programming languages, encompassing a range of programming paradigms and popularity. We evaluate two state-of-the-art code generation models on MultiPL-E: Codex and InCoder. We find that on several languages, Codex matches and even exceeds its performance on Python. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible. We describe a general approach for easily adding support for new benchmarks and languages to MultiPL-E

arXiv.org e-Print Archive

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

Author: Anderson Carolyn Jane
Cassano Federico
Feldman Molly Q.
Gouwar John
Greenberg Michael
Guha Arjun
Jangda Abhinav
Nguyen Daniel
Nguyen Sydney
Phipps-Costin Luna
Pinckney Donald
Yee Ming-Ho
Zi Yangtian
Publication venue: Digital Commons at Oberlin
Publication date: 01/07/2023
Field of study

Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages

Digital Commons at Oberlin (Oberlin College)